by Zuguang Gu (z.gu@dkfz.de), 2017-12-15 09:16:53. The github repository for this material is at https://github.com/eilslabs/teaching.
On day 1 to day 4, we have tried basic statistical plots for visualizing the results of the analysis. However, for genomic datasets which are always huge and contain a lot of information, there are more advanced visualizations which can give a global, comprehensize and complete view of the underlying data.
In this practice, I will introduce several visualization R packages (all developed by me) which are easy to make advanced visualizations for genomic datasets.
First load all the packages that we will try.
# You can simply copy and paste following lines
install.packages("circlize")
install.packages("RColorBrewer")
# I update these packages quite offen, so you need to install the newest version of these packages
install.packages("http://bioconductor.org/packages/release/bioc/src/contrib/ComplexHeatmap_1.17.1.tar.gz", repo = NULL)
install.packages("http://bioconductor.org/packages/release/bioc/src/contrib/EnrichedHeatmap_1.9.2.tar.gz", repo = NULL)
install.packages("https://bioconductor.org/packages/release/bioc/src/contrib/gtrellis_1.11.1.tar.gz", repo = NULL)
library(circlize)
library(RColorBrewer)
library(ComplexHeatmap)
library(EnrichedHeatmap)
library(gtrellis)
On day 3 and day 4, we have already tried ComplexHeatmap package to make a single heatmap with column annotations. The ComplexHeatmap has more advanced functionalities which are:
You can go to http://zuguang.de/supplementary/ComplexHeatmap-supplementary1-4/ for examples on real-world datasets. The documentation of the package is avaiable at http://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html#bioc_citation.
First we demonstrate the ComplexHeatmap package by visualizing a public single cell RNASeq data.
download.file("https://eilslabs.github.io/teaching/scrnaseq_top_500_genes.RData", "scrnaseq_top_500_genes.RData")
load("scrnaseq_top_500_genes.RData")
mat_scaled = t(scale(t(mat)))
base_mean = rowMeans(mat)
mat_cor = cor(t(mat))
cc_gene = rownames(mat)[ccl]
rpl = rpl + 0 # just to convert logical values to 0/1
ccl = ccl + 0 # just to convert logical values to 0/1
Complex visualization always replys on complex data processing. In all examples in this practice, the data has already been processed and we only focus on the visualization part.
For the single cell RNASeq dataset, there are following variables:
mat: an expression matrix for mouse with 500 genes and 81 cells. The 500 genes are selected by an algorithm that they can best separate cells. This is test data is from Buettner et al., 2015.mat_scaled: same as mat, but rows are scaled,base_mean: row means of mat, the expression level of each genemat_cor: pair-wise correlation between genesrpl: a vector of value 0 or 1 where 1 means the gene is a ribonucleo protein geneccl: a vector of value 0 or 1 where 1 means the gene is a cell cycle genecc_gene: a character vector of cell cycle genes.If you are not sure of what kind of data stored in these variables, you can use head() function to print the top 6 elements in these variables.
Please note for all above variables, the i^th row in the matrices or the i^th element in the vector always corresponds to the same gene.
In the visualization, we want to show how the cells are separated and whether the genes are ribonucleo protein genes or cell cycle genes.
Colors are important graphical attributes for heatmap, when the matrix is a numeric matrix, the color mapping should be “continuous”. To make such color mapping, we need to generate a color mapping function which can be applied to the numeric matrix to generate corresponding colors.
Since there are two numeric matrice which are mat_scaled and mat_cor, we define two color mapping functions by the colorRamp2() function. colorRamp2() function needs two arguments where the first is the break points and the second is the corresponding colors, then colors for other values can be automatically calculated by the color mapping function.
Take expr_col_fun for example, it defines blue for -1.5, white for 0 and red for 1.5, then values between -1.5 and 0 will be assigned to light blues.
expr_col_fun = colorRamp2(c(-1.5, 0, 1.5),
c("blue", "white", "red"))
cor_col_fun = colorRamp2(c(-1, 0, 1),
c("green", "white", "red"))
Following code generates a list of heatmaps which are for:
All heatmaps are concatenated by the “+” operator.
Note the 5th one is actually a row annotation constructed by rowAnnotation(). rowAnnotation() supports many types of annotation graphics such as barplots, point plots…
All the arguments in Heatmap() function are self-explained according to the argument names.
ht_list = Heatmap(mat_scaled,
name = "scaled_expr",
col = expr_col_fun,
show_row_names = FALSE,
column_title = "relative expression",
show_column_names = FALSE,
width = unit(8, "cm")) +
Heatmap(base_mean,
name = "base_expr",
show_row_names = FALSE,
width = unit(5, "mm")) +
Heatmap(rpl,
name = "ribonucleoprotein",
col = c("0" = "white", "1" = "purple"),
show_heatmap_legend = FALSE,
width = unit(5, "mm")) +
Heatmap(ccl,
name = "cell_cycle",
col = c("0" = "white", "1" = "red"),
show_heatmap_legend = FALSE,
width = unit(5, "mm")) +
rowAnnotation(link = row_anno_link(
at = which(ccl & base_mean > quantile(base_mean, 0.5)),
labels = cc_gene,
labels_gp = gpar(fontsize = 8), padding = 0.5),
width = unit(1, "cm") + max_text_width(cc_gene, gp = gpar(fontsize = 8))) +
Heatmap(mat_cor,
name = "cor",
col = cor_col_fun,
show_row_names = FALSE,
show_column_names = FALSE,
row_dend_side = "right",
show_column_dend = FALSE,
column_title = "pairwise correlation between genes")
To make the plot, use draw() function. Here we set “main_heatmap” to the correlation heatmap by specify its “name”.
draw(ht_list, main_heatmap = "cor")
Now from the heatmaps, we can make following conclusions:
The next example is an visualization on the associations between DNA methylation and gene expression.
download.file("https://eilslabs.github.io/teaching/meth_expr_integrative.RData", "meth_expr_integrative.RData")
load("meth_expr_integrative.RData")
In this example, data is randomly generated based on patterns found in an unpublished analysis. There are following variables in this data:
type: the label which shows whether the sample is tumor or normal.mat_meth: a matrix in which rows correspond to differetially methylated regions (DMRs). The value in the matrix is the mean methylation level in the DMR in every sample.mat_expr: a matrix in which rows correspond to genes which are associated to the DMRs (i.e. the nearest gene to the DMR). The value in the matrix is the expression level for each gene in each sample. Expression is scaled for every gene across samples.direction: direction of the methylation change (hyper meaning higher methylation in tumor samples, hypo means lower methylation in tumor samples).cor_pvalue: p-value for the correlation test between methylation and expression of the associated gene.gene_type: type of the genes (e.g. protein coding genes or lincRNAs).anno_gene: annotation to the gene models (intergenic, intragenic or TSS).dist: distance from DMRs to TSS of the assiciated genes.anno_enhancer: fraction of the DMR that overlaps enhancers.Again, if you don’t know what values are stored in these variables, just use head() function to check.
Since mat_meth and mat_expr have same set of samples, we need to make sure the column order is same for both matrices. We cluster columns in mat_meth and use this column orders as the order of the expression matrix:
Here hclust() which stands for hierarchical clustering need a distance matrix that can be calculated by dist() function and in the results returned by hclust(), the order element contains the order after hierarchical clustering. This is what the following code means.
column_order = hclust(dist(t(mat_meth)))$order
Similar we define two same column annotations, one for the methylation matrix and one for the expression matrix. You can use the variable ha for both matrices, but the legend for this annotation will be repeated twice.
ha = HeatmapAnnotation(df = data.frame(type = type),
col = list(type = c("Tumor" = "pink", "Control" = "royalblue")))
ha2 = HeatmapAnnotation(df = data.frame(type = type),
col = list(type = c("Tumor" = "pink", "Control" = "royalblue")),
show_legend = FALSE)
Now we make the list of heatmaps:
ht_list = Heatmap(mat_meth,
name = "methylation",
col = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red")),
cluster_columns = FALSE,
column_order = column_order,
column_dend_reorder = FALSE,
top_annotation = ha,
column_title = "Methylation",
column_title_gp = gpar(fontsize = 10),
row_title_gp = gpar(fontsize = 10)) +
Heatmap(direction,
name = "direction",
col = c("hyper" = "red", "hypo" = "blue")) +
Heatmap(mat_expr,
name = "expression",
col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
cluster_columns = FALSE,
column_order = column_order,
top_annotation = ha2,
column_title = "Expression",
column_title_gp = gpar(fontsize = 10)) +
Heatmap(cor_pvalue,
name = "-log10(cor_p)",
col = colorRamp2(c(0, 2, 4), c("white", "white", "red"))) +
Heatmap(gene_type,
name = "gene type",
col = structure(brewer.pal(length(unique(gene_type)), "Set3"), names = unique(gene_type))) +
Heatmap(anno_gene,
name = "anno_gene",
col = structure(brewer.pal(length(unique(anno_gene)), "Set1"),
names = unique(anno_gene))) +
Heatmap(dist,
name = "dist_tss",
col = colorRamp2(c(0, 10000), c("black", "white"))) +
Heatmap(anno_enhancer,
name = "anno_enhancer",
col = colorRamp2(c(0, 1), c("white", "orange")),
cluster_columns = FALSE,
column_title = "Enhancer",
column_title_gp = gpar(fontsize = 10))
draw(ht_list,
column_title = "Comprehensive correspondence between methylation, expression and other genomic features",
heatmap_legend_side = "bottom")
What if we split rows into several sub-cluster based on methylation pattern?
draw(ht_list, km = 5,
column_title = "Comprehensive correspondence between methylation, expression and other genomic features",
heatmap_legend_side = "bottom")
When spitting rows by k-means clustering into 5 sub-clusters, we can see the patterns and association more clear that the low methylated regions are enriched at TSS and enhancers while high methylated regions are enriched at intergenic regions.
Exercise Can you construct a heatmap list which only contains the methylation heatmap and the expression heatmap?
Hint: you only need to add the methylation heatmap and the expression heatmap.
Sometimes in the analysis, we are interested in whether genomic signals (e.g. DNA methylation) are enriched at specific genomic targets (e.g. TSS). Here we can use the EnrichedHeatmap package to visualize such enrichment.
In following code, we visualize the enrichment of DNA methylation and H3K4me3 histone modification around gene TSS.
The visualization includes two steps:
normalizeToMatrix() where rows are genes and columns are small windows around TSS. The value in each window are the mean signal for the genomic signals fall in.EnrichedHeatmap() functionload(system.file("extdata", "chr21_test_data.RData", package = "EnrichedHeatmap"))
tss = promoters(genes, upstream = 0, downstream = 1)
mat1 = normalizeToMatrix(H3K4me3, tss,
value_column = "coverage", # which column in `H3K4me3` is the value column
extend = 5000, # extension to upstream and downstream of tss
mean_mode = "w0", # methods to calculate mean signal in each window
w = 50, # window size
keep = c(0, 0.99)) # to adjust outliers
mat2 = normalizeToMatrix(meth, tss,
value_column = "meth",
mean_mode = "absolute",
extend = 5000,
w = 50,
background = NA,
smooth = TRUE) # smooth methylation
Visualize heatmaps for methylation, histone modification as well as gene expression as a list of heatmaps.
meth_col_fun = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red"))
ht_list = EnrichedHeatmap(mat1,
col = c("white", "red"),
name = "H3K4me3") +
EnrichedHeatmap(mat2,
col = meth_col_fun,
name = "methylation") +
Heatmap(log2(rpkm+1),
col = c("white", "orange"), # when we define two colors, one corresponds to the minimal and one corresponds to the maximum
name = "log2(rpkm+1)",
show_row_names = FALSE,
width = unit(5, "mm"))
draw(ht_list)
Generally we can see from the plot that the signal of H3K4me3 and low methylation are enriched a little bit after TSS and high signals correspond to high gene expression.
What if we split the heatmaps into three row clusters based on methylation matrix and use hierarchical clustering to determine the row order?
set.seed(123) # because k-means use random number generations
draw(ht_list, main_heatmap = "methylation", km = 3, cluster_rows = TRUE, show_row_dend = FALSE)
EnrichedHeatmap is good for association analysis for different epigenomic datasets. You can find an very complex visualization at http://bioconductor.org/packages/release/bioc/vignettes/EnrichedHeatmap/inst/doc/roadmap.html#toc_7.
After we have a list of genomic regions, some times we are interested in the distribution of the regions on the chromosomes. gtrellis privides an efficient way to visualize genomic regions by chromosomes. The full documentation of gtrellis is at https://bioconductor.org/packages/release/bioc/vignettes/gtrellis/inst/doc/gtrellis.html.
In following example gwas contains positions of SNPs as well as the p-values for determine the SNPs. Since p-values themselves are small, we use -log10(pvalue) instead.
load(system.file("extdata", "gwasCatalog.RData", package = "gtrellis"))
v = -log10(gwas[, "p-value"])
There can be some p-values which are super small and behaves as outliers. Here the value of outliers are replaced by the 95th percentile.
# remove outliers
q95 = quantile(v, 0.95)
v[v > q95] = q95
In the first try, we initialize the layout as one-row layout.
gtrellis_layout(category = paste0("chr", 1:22),
track_ylim = range(v),
track_ylab = "-log10(p)")
add_points_track(gwas, v, gp = gpar(col = "#00000080"))
First we make the normal plot. From the plot below, basically we can only see there are SNPs that show high significance and on chromosome 6 there exists a cluster where the SNP density are very high.
Next we adjust the layout, also we add another track which shows the number of SNPs in 5MB genomic windows. In the new layout, width for each chromosome is much wider than the previous plot, thus, it shows very clearly for the distribution pattern of highly significant SNPs in the genome (in the previous plot, due to the narrow plotting area for each chromosome, the distribution of SNPs seems random). The additional track gives an exact view that SNP density is dominantly high in a cluster on chromosome 6 and there are also many small hotspot mutation areas spreading the genome.
# how many SNPs in every 5MB window
d = genomicDensity(gwas, 5e6)
d[, 4] = d[, 4] * 5e6
head(d)
## chr start end pct
## 1 chr1 1 5000000 70
## 2 chr1 2500001 7500000 72
## 3 chr1 5000001 10000000 94
## 4 chr1 7500001 12500000 130
## 5 chr1 10000001 15000000 66
## 6 chr1 12500001 17500000 32
gtrellis_layout(nrow = 4,
byrow = FALSE,
n_track = 2,
category = paste0("chr", 1:22),
add_ideogram_track = TRUE,
add_name_track=TRUE,
track_ylim = c(range(v), range(d[, 4])),
track_height = c(2, 1),
track_ylab = c("-log10(p)", "#SNP"))
add_points_track(gwas, v, gp = gpar(col = "#00000080"))
add_lines_track(d, d[, 4], area = TRUE, gp = gpar(fill = "#999999", col = NA))
circlize provides a general and powerful solution for generating circular layout in R. It is greatly used in genomic data visualization. The package has extensive documentations (http://zuguang.de/circlize_book/book/) and examples (http://zuguang.de/circlize/).
circlize can do a lot of cool things. In this practice, we only go some simple examples.
The first example also shows the distribution of differnetially methylated regions on chromosomes. In the dataset, there are two variables DMR_hyper and DMR_hypo which are DMRs show high methylation in treatment samples and show low methylation in treatment samples.
In following code, we creat severa tracks:
DMR_hyper and DMR_hypoDMR_hyperDMR_hypoWhat is “rainfall”? For a set of sorted regions (sort first by chromosomes and then start positions), for region i, the distance to region i-1 and the distance to region i + 1 are calculated and the minimal value is assigned to region i as the distance to neighbouring regions. Then, when this distance value is small, it means this genomic region is close to its neighbouring regions. If we plot the distance (or log(distance)) as y-axes on the plot, then when there is a rain drop in the plot, it means there is a cluster of regions where the neighbouring distance is relatively small.
What is genomic density? For a chromosome, it is segmented by a certain window size (e.g. 10MB). For a specific type of genomic regions, the percent of every 10MB window that is covered by the input regions is calculated. Thus, when the percent value is high, it means the window is more covered by the input region.
You can see very easily where are the hyper-DMR clusters and where are the hypo-DMR clusters.
load(system.file(package = "circlize", "extdata", "DMR.RData"))
circos.initializeWithIdeogram(chromosome.index = paste0("chr", 1:22))
bed_list = list(DMR_hyper, DMR_hypo)
circos.genomicRainfall(bed_list, pch = 16, cex = 0.3, col = c("#FF000080", "#0000FF80"))
circos.genomicDensity(DMR_hyper, col = c("#FF000080"), track.height = 0.1)
circos.genomicDensity(DMR_hypo, col = c("#0000FF80"), track.height = 0.1)
Next we visualize genomic translocations. For the translocation, a position in the genome is connected to other position in the genome (maybe in a different chromosome). In following example, the translocation is randomly generated.
Note the number of rows in bed1 and bed2 should be the same, which means, they should be paired.
bed1 = generateRandomBed(nr = 100)
bed1 = bed1[sample(nrow(bed1), 20), ]
bed1[, 2] = bed1[, 3] # since each end of the translocation in single base
bed1$gene = paste0("gene1_", 1:nrow(bed1))
bed2 = generateRandomBed(nr = 100)
bed2 = bed2[sample(nrow(bed2), 20), ]
bed2[, 2] = bed2[, 3]
bed2$gene = paste0("gene2_", 1:nrow(bed2))
head(bed1)
## chr start end value1 gene
## 61 chr10 124106937 124106937 0.2982125 gene1_1
## 35 chr5 178060274 178060274 -0.1144479 gene1_2
## 105 chrX 129118122 129118122 0.1849080 gene1_3
## 25 chr4 89226810 89226810 -0.5244878 gene1_4
## 64 chr11 80065028 80065028 0.5924653 gene1_5
## 12 chr2 64684380 64684380 0.3424680 gene1_6
head(bed2)
## chr start end value1 gene
## 68 chr12 51460247 51460247 0.1432856 gene2_1
## 12 chr2 111819471 111819471 -0.4614804 gene2_2
## 97 chr21 27533849 27533849 -0.6487213 gene2_3
## 71 chr12 133174128 133174128 0.7179303 gene2_4
## 16 chr2 241483363 241483363 -0.1322976 gene2_5
## 77 chr14 26773683 26773683 -0.3590629 gene2_6
In following, we add the gene labels and the translocations are visualized as circular links.
circos.initializeWithIdeogram(plotType = c("axis", "labels"))
circos.genomicLabels(rbind(bed1, bed2), labels.column = "gene",
side = "outside")
circos.genomicIdeogram()
circos.genomicLink(bed1, bed2,
col = rand_color(nrow(bed1)),
border = NA)
circlize is good at visualizing relations which are encoded as matrice (as correlation matrices). This type of plot is always called the chord diagram.
mat contains chromatin states transitions between two biological conditions. For different regions in chromosomes, they may have different chromatin states (e.g. active transcription states or repressive states). Between different biological conditions, the chromatin states may change for a same region. The most common example is the tss of expressed genes always have active transcription states and in cancer the gene expression might be suppressed and the states for tss will become repressive states. The value in mat is the total bases of the regions that the chromatin states change.
download.file("https://eilslabs.github.io/teaching/chromatin_states_transition.RData", "chromatin_states_transition.RData")
load("chromatin_states_transition.RData")
mat
## C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
## R_1 0 79600 13400 1800 42000 0 47000 1800 14400 2800 1200
## R_2 56400 0 5000 800 8000 400 68400 0 400 400 800
## R_3 0 400 0 1800 200 1000 0 0 0 0 0
## R_4 800 200 200 0 9600 4600 17000 0 0 200 200
## R_5 98200 13000 3000 167200 0 6000 413800 400 2200 400 200
## R_6 0 0 7200 113000 7400 0 1800 0 0 0 0
## R_7 11200 38000 1400 33400 930800 32000 0 200 1000 0 800
## R_8 400 0 400 200 0 0 0 0 12600 0 0
## R_9 2400 0 0 0 4600 0 0 8600 0 0 0
## R_10 0 0 0 0 0 0 0 0 0 0 0
## R_11 0 0 0 0 0 0 0 0 0 0 0
## R_12 0 0 0 0 0 0 0 0 0 0 0
## R_13 5800 600 200 0 800 600 2400 0 3800 800 200
## R_14 11800 200 0 200 22800 200 74200 0 21400 400 0
## R_15 80200 3600 0 5200 213000 0 773800 200 74600 0 0
## C_12 C_13 C_14 C_15
## R_1 200 23200 8800 195000
## R_2 600 3600 1400 13600
## R_3 0 200 0 0
## R_4 0 1800 0 0
## R_5 8000 44000 33800 621400
## R_6 200 0 0 10200
## R_7 9400 25800 42800 987000
## R_8 0 0 0 0
## R_9 0 1200 0 49000
## R_10 0 0 0 0
## R_11 0 0 0 0
## R_12 0 1600 0 0
## R_13 22000 0 12200 8400
## R_14 7200 182600 0 576600
## R_15 1000 26200 91200 0
There are in total 15 chromatin states calculated by some tool. the first 8 are active states (classified into 8 sub states, e.g. active tss, active enhancer, …), then 6 repressive states and one null state (which means these regions has no function for transcription regulation). We assign colors to states.
Here circos.par() controls global parameters for the circular layout.
grid.col = c(rep("red", 8), rep("blue", 6), "black")
grid.col = c(grid.col, grid.col)
circos.par(gap.after = c(rep(1, 14), 10, rep(1, 14), 10))
chordDiagram(mat, annotationTrack = "grid", grid.col = grid.col)
circos.clear()
A complex visualization based on this dataset is at http://zuguang.de/circlize_book/book/a-complex-example-of-chord-diagram.html.
These is a more interesting package called HilbertCurve which provides a high resolution visualization of genomic data. You can go to https://bioconductor.org/packages/release/bioc/vignettes/HilbertCurve/inst/doc/HilbertCurve.html and http://zuguang.de/supplementary/HilbertCurve-supplementary1-6/index.html if you are interested.